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DESCRIPTION 

A METHOD FOR RECOVERING TARGET SPEECH BASED ON SPEECH 
SEGMENT DETECnON UNDER A STATIONARY NOISE 

CROSS REFERENCE TO RELATED APPLICATIONS 

This application claims priority under 35 U.S.C. 119 based upon Japanese 
Patent AppKcation No. 2003-314247, filed on September 5, 2003. The entire disclosure 
of the aforesaid ^plication is incorporated herein by reference. 

BACKGROUND OF THE INVENTION 

1. Held of the Invention 

The present invention relates to a method for recovering target speech based 
on speech segment detection und«r a stationary noise by extracting signal components 
faning in a speech segment, which is determined based on separated signals obtained 
through the Independent Component Analysis (ICA), thereby minimizing the residual 
noise in the recovered target speech. 

2. Description of the Related Art 

Recently the speech recognition technology has significantly improved and 
achieved provision of speech recognition engines with extremely high recognition 
capabilities for the case of ideal environments, i.e. no surrounding noises. However, it 
is still diJBBcult to attain a desirable recognition rate in a household environment or 
ofBces where there are sounds of daily activities and the like. In order to take advantage 
of the inherent capabi^ty of the speech recognition engine in such environments, pre- 
processing is needed to remove noises fix)m the mixed signals and pass only the target 
speech such as a speaker's speech to the engine. 

In this respect, the ICA and other speech emphasizing methods have been 
widely utilsed and various algorithms have been proposed. (For example, see the 
following five referoices: 1. "An Information Maximization Approach to Blind 
Separation and Blind Deconvolution", by J. Bell and T. J. Sejnowski, Neural 
Computation, USA, MTT Press, June 1995, Vol. 7, No. 6, pp 1129-1159; 2. "Natural 
Gradient Works Efficiently in Learning", by S. Amari, Neural Computation, USA MIT 
Press, February 1998, Vol. 10, No. 2, pp. 254-276; ^."Independent Component Analysis 
Using an Extended Informax Algorithm for Mixed Sub-Gaussian and Super-Gaussian 
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Source^ y by T. W. Lee, M. Girolami, and T. J. Sejnowski, Neural Computation, USA, 
MIT Press, February 1999, Vol. 11, No. 2, pp. 417-441; 4. ""Fast and Robust Fixed- 
Point Algorithms for Independent Component Analysis'', by A. Hyvarinen, IEEE Trans, 
Neural Networks, USA, IEEE, June 1999, Vol 10, No. 3, pp. 626 -634; and 5. 

5 ^^Independent Component Analysis: Algorithms and Applications'\ by A. Hyvarinen 
and E. Oja, Neural Networks, USA, Pergamon Press, June 2000, Vol. 13, No. 4-5, pp. 
411-430.) Among various algorithms, the ICAis a method for separating noises from 
speech on the assumption that the sound sources are statistically independent. 

Although the ICA is capable of separating noises from speech well under 

10 ideal conditions without reverberation, its separation ability greatly degrades under 
real-life conditions with strong reverberation due to residual noises caused by the 
reverberation. 

SUMMARY OF THE INVENTION 

15 In view of the above situations, the objective of the present invention is to 

provide a method for recovering target speech from signals received in a real -life 
environment. Based on the separated signals obtained through the ICA, a speech 
segment and a noise segment are defined. Tliereafter signal components £aJling in the 
speech segment are extracted so as to niinimize the residual noise in the recovered 

20 target speech. 

According to a first aspect of the present invention, the method for recovering 
target speech based on speech segment detection under a stationary noise comprises: 
the first step of receiving target speech emitted from a sound source and a noise emitted 
from another sound source and forming mixed signals at a first microphone and at a 

25 second microphone, which are provided at separate locations, performing the Fourier 
transform of the mixed signals from the time domain to the frequency domain, and 
extracting estimated spectra Y* and Y corresponding to the target speech and the noise 
by use of the Independent Component Analysis; the second step of separating the 
estimated spectra Y* into an estimated spectrum series group y* in which the noise is 

30 removed and an estimated spectrum series group y in which the noise remains by 
applying separation judgment criteria based on the kurtosis of the amplitude 
distribution of each of estimated spectrum series in Y*; the third step of detecting a 
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Speech segment and a noise segment in &e £rame number domain of the total sumVfi^f 
all the estimated ^ectrmn series in y* by applying detection judgment criteria based on 
a pxedetemiined threshold value ^ ftat is detemiined by the maximum value of F; and 
the fourth step of extracting components filing in the speech segment firom each of the 
5 estimated spectrum series in to generate a recovered spectrum group of the target 
speech, and performing the inverse Fourier transform of the recovered spectrum group 
from the frequency domain to the time domain to graerate a recovered signal of the 
target speech. 

The target speech and noise signals received at the first and second 
10 mi(n>phones are mixed and convoluted. By transforming the signals from the time 
domain to thefrequCTcy domain, the convoliited mixing can be treated as instant 
mixing, rnalrtng die separation procedure relatively easy. In addition, the somd sources 
axe considered to be statistically isidep&ad&sk; tfaus^ the IGA can be employed. 

Since split spectra obtained throu^ the ICA contain scaling ambiguity and 
15 pemmtation at each frequency, it is necessary to solve these problems first in order to 
extract the estimated spectra and Y corresponding to the target speech and the noise 
respectively. Even after that, the estimated ^ectra Y"^ at some frequencies still contain 
the noise* 

There is a well known difference in statistical characteristics betwera speech 
20 and a noise in the time domain. That is, the amplitude distribistion of speech has a high 
kurtosis with a high probability of occurrence around 0, whereas the amplitude 
distribution of a noise has a low kurtosis. The same characteristics are expected to be 
observed even after performing the Fourier transform of the speech and noise signals 
from the time domain to the frequency domain- At each frequency, a plurality of 
25 components form a spectrum series according to the frame nimiber used for 

discretization. Therefore, by examining the kurtosis of the amplitude distribution of the 
estimated spectrum s«ies m Y* at one frequ»cy, it can be judged that, if flie kurtosis 
is high, the noise is well removed at the frequency; and if the kurtosis is low, the noise 
still remains at the frequency. Consequently, each spectrum series in Y* can be 
30 assigned to dther the estimate spectrum series group y* or y, 

Since the frequency components of a speech signal varies with time, the 
ftame-nxmiber range characterizing speech varies from an estimated spectrum series to 
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an estimated spectrum series in y*. By taking a summation of all the estimated 
spectrum series in at each frame nimiber and by specifying a threshold value p 
depending on the maximum value of the speech segmmt and the noise segment can 
be clearly defined in tiie frame-number domain. 

Therefore, noise components are practically non-existent in tibie recovered 
spectrum groi:^, which is gmerated by retracting components falling in the speech 
segment from the estimated spectra Y*^. The taiget speech is tims obtained by 
performing the inverse Fourio- transform of the recovered spectrum groi^ from the 
frequCTkcy domain to the time domain. 

It is preft^able that the detection judgment criteria define the speech segment 
as a fi-ame-number range ^ere the total simi F is greater than the threshold value p and 
the noise segment as a frame-mmiber range where the total sum F is less than or equal 
to the threshold value P- Accordingly, a speech segment detection fimction, which is a 
two-valued function for selecting eitiier the speech segment or the noise segment 
depending on the threshold value p, can be defined. By use of this function, 
componoits falling in the speech segment can be easily e^rtracted. 

According to a second aspect of tiie present invention, the metiiod for 
recovering target speedi based on speedi segment detection under a stationary noise 
conqirises: tiie first step of receiving taiget speech »utted &om a sound soun:e and a 
noise emitted from another sound source ai^ forming mixed signals at a first 
microphone and at a second microphone which are provided at separate locations, 
performing the Fourier transform of the mixed signals from tiie time dcmiain to the 
frequency domain, and extracting estimated spectra and Y corresponding to the 
target speech and the noise by use of the Indepeadent Component Analysis; the second 
step of separating the estimated spectra Y* into an estimated spectrvmi series groi^ y* 
in which the noise is removed and an estimated spectrum series group y in which the 
noise remains by applying separation judgment criteria based on the kurtosis of the 
amplitude distribution of each of estimated spectrum series in Y*; the third step of 
detecting a speech segment and a noise segment in the time domain of the total sum F 
of all the estimated spectrum series in y* by applying detection judgment criteria based 
on a predetermined threshold value P that is determined by the maximum value of F; 
and the fourth step of performing the inverse Fourier transform of the estimated spectra 
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fiom the frequency domain to the time domain to generate a recovered signal of the 
target speech and extracting components falling in the speech segment from the 
recovered signal of the target speech to recover fixe target speech. 

At each frequency, a plurality of components form a spectrum series 
according to the frame number used for discretization. There is a one-to-one 
relationship between die frame number and the sampling time via the frame interval. 
By use of this relationship, the speech segment detected in the frame-number domain 
can be converted to ihe corresponding speech segment in the time domain. The other 
time interval can be defined as the noise segm^t The target speech can thus be 
recovered by performing the inverse Fourier transform of the estimated spectra Y"^ 
from the frequency domain to the time domain to generate die recovered signal of the 
target speech and retracting conq>onaits fidling in the speech segment from the 
recovered signal in the time domain. 

It is preferable that the detection judgment criteria define fte speech segment 
as atimeinterval where the to^ sum F is greater than the threshold valiie p and the 
noise segment as a time interval where the total sum F is less than or equal to the 
threshold value p. Accordingly, a speech segment detection fimction, which is a two- 
valued function for selecting either the speech segment or the noise segment depending 
on Hie threshold value p, can be defined. By use of this frmction» components &lling in 
the speech segment can be easily extracted. 

It is preferable, in both the first and second aspects of the present invention, 
that the kurtosis of the amplitude distribution of each of the estimated spectrum series 
in Y* is evaluated by means of entropy E of the amplitude distribution. The entropy E 
can be used for quantitatively evaluating the uncCTtainty of the amphtude distribution 
of each of the estimated spectrum series in Y*. In this case, the entropy E decreases as 
the noise is removed. Incidentally, for a quantitative measure of the kurtosis, ii/a^ may 
be used, where ii is the fourth moment around the mean and g is the standard deviation. 
However, it is not preferable to use ^s measure because of its non-*robustness in the 
presence of outliers. Statistically, a kurtosis is defined as the foiirth order statistics as 
above. On the other hand, entropy is expressed as the w^^ted summation of all die 
moments (0*^, 1^, 2"*^, 3"* — ) by the Taylor expansion. Therefore, entropy is a statistical 
measin'e that contains a kurtosis as its part. 
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It is preferable, m both the first and second aspects of the present invention, 
that the separation judgment criteria are given as: 

(1) if the entropy E of an estimated spectrum series in Y* is less than a 
predetemiined &reshold value a, the estimated spectrum series in Y* is assigned to 
Hxc estimated spectrum series group y*; and 

(2) if flie entropy E of an estimated spectrum series in Y* is greater than or equal 
to the threshold value a, the estimated spectrum series in Y* is assigned to the 
estimated spectrum series groiq> y. 

The noise is well removed £rom the estimated spectrum series in Y* at some 
frequencies, but not firam the others. Thraefoie, the entropy varies with Ci> . If the 
entropy E of an estimated spectrum series in Y* is less than the threshold value a, the 
estimated spectrum series in Y* is assigned to the estimated spectrum series group y* 
in which the noise is removed; and if the entropy E of an estimated spectrum series in 
Y* is greater flian or equal to the threshold value a, the estimated spectrum series in Y* 
is assigned to the estimated spectrum series group y in whicli flie noise rotnains. 

Based on the separation judgment criteria, whidi determine flie selection of 
y* ory depending on a, it is easy to separate Y* into >^ and y. 

Accoxding to the present invration as described in Claims 1, 2, 5, and 6, it is 
possible to extract signal componoats falling only in the speech segment, which is 
detemiined £com the estimated spectra corresponding to the target speech, from the 
received signals under real-life conditions. Thus, the residual noise can be minimized to 
recovOT target speech with high quality. As aresult, input operations by means of 
speech recognition in a noisy environment, such as voice commands or input for OA, 
for storage management in logistics, and for operating car navigation systems, may be 
able to replace the convaitional input operations by use of fingexs, touch censors, or 
keyboards. 

According to the present invention as described in Claim 2, it is possible to 
easily define the firame-number range characterizing tiie target speech in each estimated 
spectrum series in Y*; thus, the speech segment can be quickly detected. As a result, it 
is possible to provide a speech recognition engine with a fast response time of speech 
recovery undCT real-life conditions, and at the same time, with high recognition ability. 
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According to the present invention as described in Qaim 3, it is possible to 
extract signal components falling only in the speech segment in the time domain, which 
is determined from the estimated spectra corresponding to the target speech, from the 
received signals under real-life conditions. Thus, the residual noise can be minimized to 
5 recover target speech with high quality. As a result, input operations by means of 

speech recognition in a noisy environment, such as voice commands or input for OA, 
for storage management in logistics, and for operating car navigation systems, may be 
able to replace the conventional input operations by use of fingers, touch censors, or 
keyboards. 

10 According to the present invention as described in Claim 4, it is possible to 

easily define the time interval characterizing the target speech in the recovered signal of 
the target speech with the minimal calculation load. As a result, it is possible to provide 
a speech recognition engine with a feist response time of speech recovery under real-life 
conditions, and at the same time, with high recognition ability. 

15 According to the present invention as described in Claim 5, it is possible to 

evaluate the kurtosis of the amplitude distribution of each of the estimated spectrum 
series in Y* even in the presence of outliers. Thus, it is possible to unambiguously 
select the estimated spectrum series in Y* into y* in which the noise is removed and y 
in which the noise remains. 

20 According to the present invention as described in Claim 6, it is possible to 

unambiguously select the estimated spectrum series in Y* into y* in which the noise is 
removed and y in which the noise remains with the minimal calculation load. As a 
result, it is possible to provide a speech recognition engine with a fast response time of 
speech recovery under real-life conditions, and at the same time, with high recognition 

25 ability. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram showing a target speech recovering apparatus 
employing the method for recovering target speech based on speech segment detection 
30 under a stationary noise according to the first and second embodiments of t he present 
invention. 
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FIG. 2 is an explanatory view showing a signal flow in which a recovered 
_ spectrum is generated from the target speech and the noise per the method in FIG. 1, 
FIG. 3 is a graph showing the waveform of the recovered signal of the target 
speech, which is obtained after performing the inverse Fourier transform of the 
5 recovered spectrum group comprising the estimated spectra Y*. 

FIG. 4 is a graph showing an estimated spectrum series in y* in which the 
noise is removed. 

FIG. 5 is a graph showing an estimated spectrum series in y in which the 

noise remains. - 

10 FIG. 6 is a graph showing the amplitude distribution of the estimated 

spectrum series in y"^ in which the noise is removed. 

FIG. 7 is a graph showing the amplitude distribution of the estimated 
spectrum series in y in which the noise remains. 

FIG. 8 is a graph showing the total sum of all the estimated spectrum series in 

15 y*. 

FIG. 9 is a graph showing the speech segment detection function. 

FIG. 10 is a graph showing the waveform of the recovered signal of the target 
speech after performing the inverse Fourier transform of the recovered spectrum group, 
which is obtained by extracting components felling in the speech segment from the 
20 estimated spectra . 

FIG. 11 is a perspective view of the virtual room, where the locations of the 
soimd sources and microphones are shown as employed in the Examples 1 and 2. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

25 Embodiments of the present invention are described below with reference to 

the accompanying drawings to facilitate understanding of the present invention. 

As shown in FIG.l, a target speech recovering apparatus 10, which employs a 
method for recovering target speech based on speech segment detection under a 
stationary noise according to the first and second embodiments of the present invention, 

30 comprises two sound sources 11 and 12 (one of which is a target speech source and the 
other is a noise source, although they are not identified), a first microphone 13 and a 
second microphone 14, whfch are provided at separate locations for receiving mixed 
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signals transmiUed from the two sound sources, a first amplifier 15 and a secx»nd 
amplifier 16 for amplifying the mixed signals received at the microphones 13 and 14 
respectively, a recovering apparatus body 17 for separating the target speech and the 
noise from the mixed signals entered through the amplifiers 15 and 16 and outputting 
5 recovered signals of the target speech and the noise, a recovered signal amplifier 18 for 
amplifying the recovered signals outputted from the recovering apparatus body 17, and 
a loudspeaker 19 for outputting the amplified recovered signals. These elements are 
described in detail below. 

For the first and second microphones 13 and 14, microphones with a frequency 

10 range wide enough to receive signals over the audible range (10-20000 Hz) may be 

used. Here, the first microphone 13 is placed more closely to the sound source 11 than 
the second microphone 14 is, and the second microphone 14 is placed more closely to 
the sound source 12 than the first microphone 13 is. 

For the amplifiers 15 and 16, amplifiers with frequency band characteristics 

15 . that allow non-distorted amplification of audible signals may be used. 

The recovering apparatus body 17 comprises A/D converters 20 and 21 for 
digitizmg the mixed signals entered through the amplifiers 15 and 16, respectively. 

The recovering apparatus body 17 further comprises a split spectra generating 
apparatus 22, equipped with a signal separating arithmetic circuit and a spectrum 

20 splitting arithmetic circuit. The signal separating arithmetic circuit performs the Fourier 
transform of the digitized mixed signals from the time domain to the frequency domain, 
and decomposes the mixed signals into two separated signals Ui and Ua by means of 
the Fast ICA. Based on transmission path characteristics of the four possible paths from 
the two sound sources 11 and 12 to the first and second microphones 13 and 14, the 

25 spectrum splitting arithmetic circuit generates from the separated signal U i one pair of 
split spectra vn and v^ which were received at the first microphone 13 and the second 
microphone 14 respectively, and generates from the separated signal U2 another pair of 
split spectra V21 and V22 which were received at the first nucrophone 13 and the second 
microphone 14 respectively. 

30 Hie recovering apparatus body 17 further comprises an estimated spectra 

extracting circuit 23 for extracting estimated spectra Y* of the target speech, wherein 
the split spectra vn, vn, vzi, and vzz are analyzed by applying criteria based on sound 
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10 



15 



transmission diaiacteristics that depend on the four different distances between the first 
and second inicrophones 13 and 14 and flie sound sources 11 and 12 to assign eadi spUt 

spectrum to the target speech or to the noise. 

ThercoovOTng apparatus hody 17 fiirOier oonqnises a speedi segment 
detection drcuit 24 for separating the estimated spectra Y* into an estimated spectrum 
series group y* in which the noise is removed and an estimated spectnnn series &ovp y 
in which the noise remains by applying separation judgment criteriabased on the 
kurtosis of the ampUtude distribution of eadi of the estunated spectrum series in Y*, 
and detecting a speedi segment in the frame-number domain of a total sum F of all the 
estunated spectrum series in y* by flying detection judgment criteria based on a 
threshold value p ibat is determined by the maximmn value of F. 

The recovering apparatus body 17 further comprises a recovered spectra 
extracting circuit 25 for ejctracting conqranents fidling in the speech segment from « 

of the estimated qKXtnan series in Y* to generate a recovered q)ectrum gioiq^ 



The recovering a?iparatus bo^ 17 forflier conq>rises a recovered signal 

generating aicuit 26 for perfonning the inverse Fourier transform of flie recovered 
spectrum groiq» from the frequency domain to foe time domain to generate the 

recovaed agnal of the target q>eech. 
20 Tbe spUt spectra generating apparatus 22. equipped with the signal separating 

arithmetic circuit and the spectrum spUtting arifometic dnarit, the estimated spectra 
extracting circuit 23, the speech segment detection circuit 24, the recovered spectra 
extracting circuit 25, and therecovered signal generating circmt26 may be struct^ 

by loading programs for executing eadi circuits fimctions on, for example, a personal 
25 computer. Also, it is possible to load the programs on a pluraliQr of microconqmters 

and form a circuit for collective operation of these microcomputers. 

In particular, if the programs are loaded on a personal computer, the entire 

recovering 8?»paratus body 17 may be structured by incorporating the A/D converters 

20 and 21 into tiie pCTSonal computer. 
30 For the recovered signal amplifier 1 8, an amplifier that allows analog 

conveision and non-distorted ampUfication of audible signals may be used. A 
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loudspeaker that allows non-distorted output of audible signals may be used for fiie 
loudspeaker 19. 

The xnetihod for recovering target speech based on speech segnoient detection 
under a stationary noise according to tiie first embodimCTt of tilie present invention 
5 comprises: the first step of recdving a signal 5i(t) fix>m the soimd source 1 1 and a 

signal S2(t) firom fbe sound source 12 at the first and second microphones 13 and 14 and 
forming mixed signals xi(t) and X2(t) at the first microphone 13 and at the second 
microphone 14 respectively, performing the Fourier transform of the mixed signals 
Xi(t) and X2(t) from the time domain to the frequency domain, and extracting estimated 

10 spectra Y* and Y corresponding to the target speech and the noise by use of the Fast 

ICA, as shown in FIG. 2; the second step of separating the estimated spectra Y* into an 
estimated spectrum series group y**" in which the noise is removed and an estimated 
spectrum series group y in which the noise remains by applying separation judgment 
criteria based on the kurtosis of the anq>litude distribution of eadi of the estimated 

15 spectrum series in Y"*"; the third step of detectmg a speech segment and a noise segment 
in the frame-number domain of a total sum F of all the estimated spectrum series in 
by ^pl3dng detection judgment criteria based on a threshold value p that is determined 
by ihe maximum value of F; and the fourth stqp of extracting conq>onents fiEiUing in the 
speech segment &om each of the estimated spectrum series in Y'^ to generate a 

20 recovered spectrum group of tiie target speech, and performing the inverse Fourier 
transform of the recovered spectrum group from the frequency domain to the time 
domain to generate the recovered signal of flie target ^eech. The above steps are 
described in detail below. Here, '*f * represents time througfhout. — 

25 1. First Step 

In general, the signal si(t) fix>m the sound source 1 1 and the signal S2(t).&om 
the sound source 12 are assumed to be statistically independCTt of each other. The 
mixed signals Xi(t) and X2(t), which are obtained by receiving the signals si(t) and S2(t) 
at the microphones 1 3 and 1 4 respectively, are expressed as in Equation (1 ): 

x(t)=G(t)*s(t) (1) 

30 
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where s(t)=[si(t), S2(t)]^, x(t)=[xi(t), xaO)]*^, * is a convolution operator, and G(t) 
represents transfer functions from the sound sources 1 1 and 12 to the first and second 
microphones 13 and 14. 

As in Equation (1), when the signals from the soimd sources 1 1 and 12 are 
convoluted, it is difficult to separate the signals si(t) and S2(t) from the mixed signals 
xi(t) and X2(t) in the time domain. Therefore, the mixed signals xi(t) and X2(t) axe 
divided into short time intervals (frames) and are transfoimed from the time domain to 
the frequency domain for each frame as in Equation (2): 

xj(fl>,k) = 2e*"^"'xj(t)w(t-kr ) (2) 

t 

(j=l,2;k=0, 1, •••,K-l) 
where cd (=0, 2ic/M, 2ie(M-1)/KQ is a normalized &equencyy M is the number of 
sampling in a frame, w(t) is a window fimction, x is a frame interval, and K is the 
number of frames. For raample, the tune int^val can be about several 10 msec. In tins 
way, it is also possible to treat the spectra as a group of spectrum series by laying out 
the components at each frequency in the order of frames. Moreover, in the frequency 
domain, it is possible to treat the recovery problems just like in the case of instant 
mixing. 

In this case, mixed signal spectra x(cd Jc) and corresponding spectra of the 
signals si(t) and S2(t) are related to each other in the frequency domain as in Equation 
(3): 

X ( a> , k) =G ( CO ) s ( CO , k) (3) 

where s(0> Jc) is the discrete Fourier transform of a windowed s(t), and G(co) is a 
ccmiplex number matrix that is the discrete Fourier transform of G(t). 

Since the signal spectra si(cd Jc) and S2(<o,k) are inherently independent of each 
other, if mutually independent separated signal spectra U](a>Jc) and U2(a),k) are 
c:al(mlated from the mixed signal spectra x((i) Jc) by use of the Fast ICA, these separated 
spectra will correspond to the signal ^ectra SiCco Jc) and S2(GdJc) respectively. In other 
words, by obtaining a separation matrix H(<i))Q(cd) vnih which the relationship 
expressed in Equation (4) is valid between the mixed signal spectra x(a>Jc) and the 
separated signal spectra Ui((d4c) and U2(co,k), it becx)mes possible to determine &e 
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mutually independent separated signal spectra Ui((D,k) and UaCco^k) from the mixed 
signal spectra x(oE>,k). 

u(a), k) = H(ci>)Q(co)x(a)) (4) 

where n(a>4c)=lIJi(G)4c),U2(a),k)]^. 

Incidentally, in tiie frequency domain, amplitude ambiguity and permutation 
occur at individual frequencies as in Equation (5): 

H(q> )Q(o> )G(q> )=PD(o> ) (5) 

y/bere H(a>) is defined later in Equation (1 0), Q(a>) is a whitening matrix, P is a matrix 
representing pemiutation with only one element in each row and each cohmm being 1 
and all the other elraients being 0, and D(<3f>)==diag[d](o>),d2(c[>)j is a diagonal matrix 
reprraenting the amplitude ambiguity. Therefore, these problems need to be addressed 
in order to obtain meaningful separated signals for recovering. 

In the frequency domain, on the assumption that its real and imaginary parts 
have the mean 0 and &e same variance and are nncocrelated, each sound source 
spectrum Si(GiJc) (i==l^) is formulated as follows. 

First, at a frequency co, a separation weight ha(co) (n=l,2) is obtained according 
to the FastICA algorithm, which is a modification of the Independent Component 
Analysis algozithm, as shown in Equations (6) and (7): 

h^( o> )=^Z {x ( Q> , k) Un( O) , k) f ( |u«( Q> . k) 1^) 

-[f (|u„(o>, k) |^) + |u«(co,k)|V(|uo(a> ,k) |'')3h„(o>)} 

(6) 

h„(a>)=h;(a> )/ II hn ( a> ) jj (7) 

25 where f(|U|,(G>,k)|^) is a nonlinear fiincticm, and f(|un(cE>Jk:)p) is the derivative of 
jG[|uB(ci>^)|\ is a conjugate sign, and K is the number of frames. 

This algorithm is repeated xintil a convergence condition CC shown in Equation 

(8): 

13 
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20 



CC=hI(a> )hJ(a) )2rl (8) 

is satisfied (for example, CC becomes greater flian or equal to 0.9999). Further, h2(Q>) is 
orthogonalized with hi(a>) as in Equation (9): 



hzCct) )'h2(c» )-hi(ci> )hI(oi> )h2(c» ) (9) 

and normalized as in Eqiiation (7) again. 

The aforesaid FastICA algorithm is carried out for each frequency co. Hie 
obtained separation weights hn(a>) (n-1,2) detennine H(a>) as in Equation (10): 



10 H(a>) 



Lh^2(a)). 



(10) 



which is used in Equation (4) to calculate the s^aiated signal spectra u(ak^) 
[Ui(a>4c),U2(a>Jk:)]^ at each frequency. As shown in FIG. 2, two nodes where the 
sq>arat6d signal spectra Ui(cnjc) and UzCcoJc) are outputted are referred to as 1 and 2. 

The q)Iit spectra vi(co^>=[vii(©,k),vi2(a)^)]'^ and V2(G):Jc)=[v2i(G),k),V22(a>4c)]'^ 
IS are defined as spectra generated as a pair (1 and 2) at nodes n (=1, 2) from the separated 
signal spectra Ui(c(ijk) and Uzia^Jk!) respectively, as shown in Equations (1 1) and (12): 

(11) 



hJ'''l!!l=(H(<»)Q(a,))-f„/ ,J 
LV22(a),k)J LU2(a>,k)J 

• (12) 



If the pemiutation is not occurring but the amplitude ambiguity exists, the 
separated signal spectra Uo(ca,k) are outputted as in Equation (13): 
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rUi(a),k)l rdi(a))si(a).k)l 

LU2(a> , k)rLd2(a) ) S2(a) , k) J ^^^^ 

Then, the split spectra for the above s^acated signal spectra Uo(a>»k) are generated as in 
Equations (14) and (15): 



rvii(a) , k) 1_r gii(a) ) si(o) , k) 1 

LV12( 0) , k) J Lg21 ( 0) ) Sl ( Q) » k) J 



(14) 



[V2i( o) , k) "j_rgi2( o) ) S2( a> , k) 1 . 
V22(ci>,k)J Lg22( O) ) S2( O) , k) J 

which show that the split spectra at each node are e^qiressed as the product of the 
10 spectrum si((aM) and Hhe transfer function, or the product of the spectrum $2(cd40 and 
the transfer function. Note here that gii(a>) is a transfer function from the sound source 
II to the first microphone 13, g2i(a>) is a tran^r function firan the sound source 11 to 
the second microphone 14, gizitoi) is a transfer function Scorn &e sound source 12 to die 
first microphone 1 3, and g22(<») is a transfo- function from the sound source 12 to the 
IS second microphone 14. 

If there are both permutation and amplitude ambiguity, die separated signal 
spectra Un(oi>Jk:) are expressed as in Equation (16): 

rUi(Q),k)l rdi(a))s2(a),k)] 

LU2( fi> , k) J"Ld2( fi) ) s, ( 0) , k) J ^^^^ 



and the split spectra at the nodes 1 and 2 are generated as in Equations (17) and (18): 
fviiC 0) , k) "]_rgi2( o) ) S2(a) , k) 1 

Lvi2(a),k)J Lg22( 6) ) S2( O) , k) J 

rv2i ( 0) , k) Ifgii ( CD ) si ( CD , k) 1 - ^ ^ ^ rr(ifi\ 

Lv22(a) , k) J Lg2l(CD )si(cD , k) J 



15 

RECTIFIED SHEET (RULE SI) 
ISA/EP 



wo 2005/029463 



PCT/JP2004/012899 



In the above, the spectram vn(tt),k) generated at the node 1 represents the signal 
spectrum S2((0,k) transmitted from the sound source 12 and observed at the first 
microphone 13, the spectrum vi2(cx)Jc) generated at the node 1 r^resents the signal 
5 spectrum S2(G>,k) transmitted from the sound source 12 and observed at the second 
microphone 14, the spectrum V2i(a)^) generated at &e node 2 r^resents the signal 
spectrum sj((D»k) transmitted from tiie sound source 1 1 and observed at the first 
microphone 13, and the spectrum y22i^M} generated at tiie node 2 represents the signal 
spectrum S](a>»k) transmitted firom tiie sound source 1 1 and observed at the second 

10 microphone 14. 

The four spectra vii(cDjk), Vi2(G>,k), V2i(G>,k) and V22(€0^) shown in FIG. 2 can 
be separated into two groins, eadi consisting of two split spectra. One of the groups 
corresponds to one sound source, and tiie oth^ corresponds to the other soimd source. 
For example, in the absence of permutation, vii(cD,k) and Vi2(ci)jk) correspond to one 

15 sound source; and in the presence of permutation, V2i(a>Jc) and V22(o),k) correspond to 
the one sound source. Due to sound transmission characteristics, for example, soimd 
intensities, that depend on the four different distances betwe^ tiie first and second 
microphones and the two sound sources, spectral intensities of the split ^ectra Vn, Vi2, 
V2b and V22 differ from one another. Therefore, if distinctive distances are provided 

20 between the microphones and the sound sources, it is possible to detennine which 

mioxiphone received whidi sound source^s signaL That is, it is possible to identify the 
sound source for each of the split spectra vi 1, V12, V21, and V22. 

Here;, it is assumed that the sound source 1 1 is closer to the first microphone 
13 than to the second microphone 14 and that the sound spurce 12 is closer to the 

25 second microphone 14 than to the first microphone 13. In this case, comparison of 

transmission charactCTStics betwe^ the two possible paflis from the sound source 1 1 to 
the microphones 13 and 14 provides a gain corcpaiison as in Equation (19): 

|gii(a>) |>|g2i(a>) I (19) 
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Similarly, by comparing transmission characteristics between the two possible paths 
from the sound source 12 to the microphones 13 and 14, a gain comparison is obtained 
as in Equation (20): 



|gi2(a>) |<|g22(a>) I (20) 

Li this case, when Equations (14) and (15) or Equations (1 7) and (1 8) are used with the 
gain comparison in Equations (19) and (20), if there is no permutation, calculation of 
the difFermce Di betwe^ the spectra vn and Vi2 and the difference D2 between iho 
spectra V21 and V22 shows that Di atthenodel is positive and D2 at &e node 2 is 
negative. On the other hand, if there is permutation, the similar analysis shows that Di 
at the node 1 is negative and P2 at the node 2 is positive. 

In other words, the occurrence of permutation is recognized by examining the 
differences Dj and D2 between respective split spectra: if Dj at the node 1 is positive 
and D2 at the node 2 is negative, the permutation is considered not occurring; and if Dj 
at the node 1 is negative and D2 at the node 2 is positive, the permutation is considered 
occurring. 

In case the difference Di is calculated as a difference between absolute values 
of the spectra vn and V12, and the difference D2 is calculated as a difference between 
absolute values of the spectra V21 and V22> the differences D] and P2 are e7q>ressed as in 
Equations (21) and (22), respectively: 

Di=| vii(ci> . k) |-| vi2(a) , k) I (21) 

D2=|v2i(o> , k) |-| V22(a} , k) I • (22) 

If there is no pennutation, vii(€d4c) is selected as a spectrum y](G),k) of the 
signal from the one sound source that is closer to the first microphone 1 3 than to the 
second microphone 14. This is because the spectral intensity of vii(g)Jc) observed at the 
first nMcrophone 13 is greater than the spectral intensity of vi2(<d4c) observed at the 
second microphone 14, and Vii(coJk:) is less subject to the background noise than 
vi2(cD,k). Also, if there is permutation, V2i(a)Jk) is selected as the spectrum yi(c£)Jc) for 
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the one sound source. Therefore, the spectrum yi(cD^) for flie one sound soiu-ce is 
expressed as in Equation (23): 



Similarly for a spectrum yzCco^) for tiie other sound source, the spectrum 
V22(cD,k) is selected if there is no pemiutation, and the q>ectrum Vi2(a>,k) is selected if 
there is permutation as in Equation (24): 



The permutation occurrence is deteraiined by using Equations (21) and (22). 

The FastICA method is characterized by its capability of sequentially 
separating signals from the mixed signals in descending order of non-Gaussianity. 
Speech generally has higher nonrGaussianity than noises. Thus, if observed sounds 
consist of the target speedi 0.e.» speaker's speech) and &e noise, it is hig^y probable 
that a split spectrum corresponding to the speaker's speech is in ibo sq>arated signal Ui , 
iKdiich is the first output of this mediod. Thus, if ibc one sound source is the speaker, 
the permutation occurrence is highly unlikelsr; and if the other sound source is the 
speaker, the permutation occurrence is highly likely. 

Thoefore, while the spectra yi and are gmerated, tiiie number of 
permutation occurrences and the number of non-occurrences over all the 
fi-equencies are counted, and the estimated spectra Y*" and Y are determined by using 
the criteria given as: 

(a) if the count K*" is greater than the coimt hT, select the spectrum y i as the 
estimated spectrum Y* and select the spectrum yz as the estimated spectrum Y; or 

(b) if the count is greater than the coimt N^, select the spectrum yz as the 
estimated spectrum Y"^ and select the spectrum yi as the estimated spectrum Y. 




if Di>0,D2<0 
if Di<0,D2>0 



(23) 




if Di<0, D2>0 
if Di>0,D2<0 



(24) 



2. Second Step 
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FIG3 shows Hxe wavefonn of flie target speedi ( "Tokyo" ), which was 
obtained after the inverse transform of the recovered spectrum group comprising the 
estimated spectra as obtained above. It can be seen in this figure that the noise signal 
still remains in the recovered signal of the target speech. 

Therefore, the estimated spectrum series at each frequency was investigated. 
It was found that the noise had been removed from some of the estimated spectrum 
series in Y*, and an example is shown in FIG. 4, and the noise still remains in the other 
estimated spectrum series in Y*, and an example is shown in FIG.5. In the estimated 
spectrum series in which the noise has been removed, the amplitude is large in the 
speedi segmmt, and is extremely small in the nonrspeech segment, clearly defining the 
start and end points of the speech segment Thus, it is expected that by using only the 
estimated spectrum series in which the noise has been removed, tho speech segment 
can be obtained accurately. 

FIG. 6 shows the amplitude distribution of the «timated spectrum series in 
FIG. 4; and FIG. 7 shows the amplitude distribution of the estimated spectrum series in 
FIG. S. It can be seen finom these figures that the amplitude distribution of the estinoiated 
spectrum series in which the noise has been removed has a hig|i kurtosis; and the 
anipUtude distribution of the estimated spectrum series in which the noise remains has a 
low kurtosis. Therefore, by applying separation judgment criteria based on the knrtosis 
of the amplitude distribution of each of fhe estimated spectrum series in Y*, it is 
possible to separate the estimated spectra Y* into an estimated spectrum series group 
y"*" in which the noise has been removed and an estimated spectrum series group y in 
which the noise remains. 

In order to quantitatively evaluate kurtosis values, entropy E of an amplitude 
distribution may be employed. The entropy E represents uncertainty of a main 
amplitude value. Thus, when the kurtosis is high, the entropy is low; and when the 
kurtosis is low, the CTtropy is high. Therefore, by use of a predetermined threshold 
value a, tiie separation judgment criteria are given as: 

(1) if the entropy E of an estimated spectrum series in Y'*' is less than the threshold 
value a, the estimated spectrum series in Y"*" is assigned to y*; and 

(2) if the entropy E of an estimated spectrum series in Y* is greater than or equal to 
the threshold value a, the estimated spectrum series in Y* is assigned to y. 
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The entropy is defined as in the following Equation (25): 



E(a)) 



N 

=--2 P o)(ln) logPa>(lii) 



(25) 



whsepfi^On) (n= 1,2, • •%]>0 is a probabiUty, which is equivalent to (la) (n= 1,2, 
-*% N) nonnalized.as in the following Equation (26). Here^ In indicates fhe n-th interval 
when the amplitude distribution range is divided into N equal intervals for the real part 
of an estunated spectrum series at eadi fiequenc^ in Y*, and qco (1&) is a frequency of 
occurrence within the n-tii interval. 



3, Third Step 

Since the frequency components of a speech signal varies with time, the 
frame-number range characterizing speedi varies from an estimated spectrum series to 
an estimated spectrum series in y*. By taking a summation of all the estimated 
spectrum series in y^ at each frame number, the frame-numbs range characterizing the 
speech can be clearly defined. An exanq^le of the total sum F of all the estimated 
spectrum series in y"*" is shown in FIG. 8, where each amplitude value is normalized by 
the maximimi value (which is 1 in FIG. 8). By specifying a tlireshold value p depending 
on the maximum value of F, the frame number range where F is greater than p may be 
defined as iho speech segment, and the frame number range where F is less than or 
equal to p may be defined as the noise segment. Therefore, by applying the detection 
judgmmt criteria based on the amplitude distribution in FIG. 8 and the threshold value 
P, a speech segment detection fimction F'^(k) is obtained, where F'*'(k:) is a two-valued 
fimctidn which is 1 when F> p, and is 0 v^^en F< p, 

4. Fourth Step 

By multiplying each estimated spectrum saies in Y* by the speech segment 
detection fimction F*(k), it is possible to extract only the components falling in the 



N 



Pwdn )=qo>(ln)/S Qiodn) 



(26) 
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speech s^;ment fiom the estimated spectrum series. Thereafter, the recovered spectrum 
group {Z(cD, k) i k = 0, I9 **% K-1 } can be generated fiom all the estimated spectrum 
series in Y*, each having non-zero components only in the speech segment. The 
recovered signal of the target speech Z(t) is thus obtained by perfonning the inverse 
5 Fourier transform of the recovered spectrum group {Z (co, k) | k — 0, 1, — , K-1 } for 

each frame back to the time domain, and tiiien taking the summation over all the frames 
as in Equation (27): 

Z(t)=3i^SSe^-<--'z(..k) 

W(t) = Zkw (t-k T ) (27) 

10 

FIG. 10 shows the recovered signal of the target speech after the inverse 
Fourier transform of the recovered spectrum groiq>> which is obtained by multiplying 
each spectrum seri^ in Y**" by tihe speech segment d^ection ftmction. It is clear upon 
comparing FIGs. 3 and 10 that there is no noise remaining in the recovered target 

15 speech in FIG. 10 unlike the recovered target speech in FIG. 3. 

The method for recovering targ^ speech based on speech segment detection 
under a stationary noise according to the second embodiment of the present inv^dntion 
comprises: the first step of receiving a signal S|(t) from fte sound source 1 1 and a 
signal S2(t) from the sound source 12 (one of which is a target speech source and the 

20 other is a noise source) at the first and second microphones 13 and 14 and forming 

mixed signals xi(t) and X2(t) at the first microphone 13 and at the second microphone 
14 respectively, performing the Fourier transform of the mixed signals xi(t) and X2(t) 
fix)m the time domain to the frequency domain, and extracting the estimated spectra Y* 
and Y corresponding to the target speech and the noise by use of the Fast ICA, as 

25 shown in FIG. 2; the second step of separating the estimated spectra Y* into an 

estimated spectrum series groi^ y"*" in which the noise is removed and an estimated 
spectrum series group y in which the noise remains by applying separation judgment 
criteria based on the kurtosis of the anqplitude distribution of each of die estimated 
spectrum series in Y^; the third stq> of detecting a speech segment and a noise segment 
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in the time domain of a total sum F of all the estimated spectram series in y * by 
applying detection judgment criteria based on a threshold value p that is determined by 
the maximmn value of F; and the foiirth step of perfomiing the inverse Fourier 
transform of the estimated spectra from the £requ^cy domain to the time domain to 
generate a recovered signal of the target speedi and extracting componratts falling in 
the speech segment from the recovered signal of the target speech to recover the targ^ 
speech. 

The differences in method between the first and second embodiments are in 
the third and fourth steps. In the second embodiment^ the speech segment is obtained in 
the time domain, and the target speecdi is recovered by extracting the components 
falling in the speech segment from the recovered signal of ttie target speech in the time 
domain. Therefor^ only the third and fourth steps are explained below. 

The relationship between the fi:ame number k and the sampling time t is 
expressed as: x (k-1) < t k, v/bssre x is the fi-ame interval. Thus, k = [t/ x] holds, 
where [t/ x] is a Ceiling symbol indicating the smallest integer among all the integers 
larger than t/ x, and a speech segment detection function in the time domain F'^Ct) can 
be defined as: F*(t) = 1 in the range where F*([t/ r]) = 1; and F*(t) = 0 in the range 
where F*i[t/ x]) 0. Therefore, in the third step in the second embodiment, tiie speech 
segment is defined as the range in the time domain v^diere F*(J[t/ x]) = 1 holds; and tiie 
noise segment is defined as the range in the time domain where F'^([t/ x]) = 0 holds. 

In the fourth step of the second embodimaot, the recovered signal of the target 
speedi, whidi is obtained after the inverse Fouri^ transform of the estimated spectra 
Y"*" fix>m the frequency domain to the time domain, is multf>iied by F'^'Ct), ^^dbi is the 
speedi segment detection function in the time domain, to extract tiie target speedi 
signal. 

The resultant target speech signal is amplified by the recovered signal amplifier 18 and 
inputted to the loudspeaker 19. 

(A) Example 1 

Experiments were conducted in a virtual room with 10m length, IQm width, 
and 10m height. Microphones 1 and 2 and sound sources 1 and 2 were placed in the 
room as in the FIG. 11. The mixed signals received at the nucrophones 1 and 2 were 



22 

RECTIFIED SHEET (RULE 91) 
ISA/EP 



wo 2005/029463 PCT/JP2004/0 12899 



analyzed by use of tiie FastlCA^ and a noise was removed to recovo: the target speech. 
The detection accvacBcy of the speech segment was evaluated. 

The distance between the nouciophones 1 and 2 was 0.5m; the distance 
between the two sound sources 1 and 2 was 0-5m; the n:iicrophones were placed Im 
5 above the floor level; flie two soimd sources were placed 0.5m above the floor level; 
the distance between the microphone 1 and the sound source 1 was 0.5m; and the 
distance betwcKi the microphone 2 and the sound source 2 was 0.5m. The FastlCA was 
carried out by emplojring the method described in ^Permutation Correction and 
Speech Extraction Based on Split Spectrum through Fast IC/T by H. Gotanda, K. 

10 Nobu, T. Koya, K. Kaneda, and T. Ishibashi, Proc. of International Symposium on 

Independent Component Analysis and Blind Signal Separation, April 1, 2003» pp379^ 
384. At the sound source 1, eadi of two speakers (one male and one fonale) was placed 
and spoke five difference words (zairyo, iyoiyo^ urayamamy omosiroU and giiai)^ 
emitting total of tCTi different speech pattons. At the sound source 2, five different 

15 stationary noises (fl6 noise, volvo noise, white noise^ pink noise, and tank noise) 

selected SromNoisex-92 Database flittp://spib.rice>edu/spib^ were emitted. From the 
above, total of 50 different mixed signals were generated. 

The speech segment detection function F*(k) is two-valued depending on the 
total sum F with respect to the threshold value p, and the total sum F is detemiined 

20 firom tiie estimated spectrum series groiq> which is separated &om the estimated 
spectra Y"*" according to the threshold value a; thus, the speech segment detection 
accuracy depends on a and p. Investigation was made to detemiine optimal values for a 
and p. The optimal values for a were found to be 1.8 — 23; and the optimal values for p 
were found to be 0.05 - 0.15. The values of a — 2.0 and P = 0.08 were selected. 

25 The start and end points of the speech segment were obtained according to the 

present method. Also, a visual inspection on the waveform of the target speech signal' 
recovered fix>m the estimated spectra Y'*' was carried out to visually detemiine the start 
and end points of die speech s^ment. The conxparison between the two methods 
revealed that &e start-point of the speech segment determined according to the present 

30 method was -2.71msec (with a standard deviation of 13.49msec) with respect to the 

start^bint detemiined by the visual inspection; and the end point of the speech segment 
determined according to the present method was -4.96msec (with a standard deviation 
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of 26.07msec) with respect to the end point detennined by the visual inspection. 
Therefore, the present method had a tendency of detecting the speech segment earlier 
that the visual inspection. Nonetheless, the dijQference in the speech segment between 
the two methods was very small, and the present method detected the speech segment 
5 with reasonable accuracy. 

(B) Exan^)le 2 

At the sound source 2, five different non-stationary noises {office, restaurant^ 
classical, station, and street) selected from NTT Noise Database (Ambient Noise 

10 Database for Telephonometry, NTT Advanced Technology Inc., 1996) were emitted. 
Experiments were conducted with the same conditions as in Example 1. 

The results showed that the start point of the speech segment determined 
according to the present method was -236msec (with a standard deviation of 
14.12msec) with respect to the start point determined by the visual inspection; and the 

15 end point of the speech segment determined according to the present method was - 
13.40 msec (with a standard deviation of 44.12msec) with respect to the end point 
determined by the visual inspection. Therefore, the present method is capable of 
detecting the speech segment with reasonable accuracy, functioning almost as well as 
the visual inspection even for the case of a non-stationary noise. 

20 While the invention has been so described, the present invention is not limited 

to the aforesaid embodiments and can be modified variously without departing fi-om the 
spirit and scope of the invention, and may be applied to cases in which the method for 
recovering target speech based on speech segment detection under a stationary noise 
according to the present invention is structured by combining part or entirety of each of 

25 the aforesaid embodiments and/or its modifications. 

For example, in the present method, the FastICA is employed in order to 
extract the estimated spectra Y* and Y corresponding to the target s peech and the noise 
respectively, but the extraction method does not have to be limited to this method. It is 
possible to extract the estimated spectra Y* and Y by using the ICA, resolving the 

30 scaling ambiguity based on the sound transmission characteristics that depend on the 
. ^ four different paths between the two microphones and tiie sound sources, and resolving 
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the permutation problem based on the similarity of envelop curves of spectra at 
individual frequencies. 
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